-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eviction: regression test + distinguish layer write from map insert #4005
Conversation
Test results for 9b41b5f:debug build: 225 tests run: 215 passed, 0 failed, 10 skipped (full report)release build: 225 tests run: 215 passed, 0 failed, 10 skipped (full report) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I duplicate the explainer of the race condition somewhere in the code, as a comment? If so, where?
I think we saw this in logs but probably forgot to add a note. But good that you re-hit that. |
Makes me think the #3775 fix or wrapping is more needed than not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly trusting the PR desc, code matching it and the test passing, looking good. Comments mostly questions.
Yeah, had that thought as well. |
…ote_layer" This reverts commit 3f96e00ed027487f35798dce50b15ded3e6a1baf.
72fad66
to
767c57f
Compare
Changes:
@koivunej WDYT, is the new approach better? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still on the approval stance. I don't see a huge difference, but the witness
certainly does guide towards the wanted implementation. At the same time it cannot guide towards not-leaving the recording out, which doesn't really fit in our current types. Would require some two-step builder.
But it's not like we are trying to make a fool-proof library out of this, and the only downside is disk based eviction giving out a warning.
@koivunej in reaction to your comment on the log message, I pushed |
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to dismiss my earlier review. Either I may have confused some two PRs or I don't really follow the latest changes. I don't have time to look at this this evening, will need to continue later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred the non-last minute version; having to require the ratelimited logging is a problem in itself caused by the fact how we currently don't always have a last access timestamp for what I understand, api design reasons. It wasn't a problem in the past so I already forgot why is it a problem now.
I preferred the pre-ratelimited version at f480844^
which I already approved; the logging could be done at the callsites (for disk usage based eviction) and maybe for eviction task we could handle it in some summary if at all.
Looking at this it will take a lot longer to understand what is the case where it remains None
, I think I proposed SystemTime::now
for that which I can now find from the other.
We never expect to hit the log message because we're careful to add the
For disk-usage-based eviction, rate-limiting to once per iteration is fine. For the per-timeline eviction task, rate-limiting to once per ieration still leaves IMO that's too much, hence the global rate limit. The ugliness of the rate limiting is constrainted to the impl, where we have a "nice" comment explaining the situation.
The Does the above convince you? If not, please state what to do with the per-timeline eviction task logging. I'll implement that on top of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still dislike the approach (RateLimitClosure
or after rename), static mutexes, but perhaps it truly is unavoidable or at least I don't have the energy to come up with alternative. At least the current review comments need to be addressed.
Following Joonas's earlier suggestion in #4005 (comment)
…iction-policy-unit-tests
Updated the PR description to the final commit message. |
This patch adds a regression test for the threshold-based layer eviction.
The test asserts the basic invariant that, if left alone, the residence statuses will stabilize, with some layers resident and some layers evicted.
Thereby, we cover both the aspect of last-access-time-threshold-based eviction, and the "imitate access" hacks that we put in recently.
The aggressive
period
andthreshold
values revealed a subtle bug which is also fixed in this patch.The symptom was that, without the Rust changes of this patch, there would be occasional test failures due to
WARN... unexpectedly downloading
log messages.These log messages were caused by the "imitate access" calls of the eviction task.
But, the whole point of the "imitate access" hack was to prevent eviction of the layers that we access there.
After some digging, I found the root cause, which is the following race condition:
LayerCreate
with the current timestamp.LayerCreate
event.The L1 layer had no chance of being accessed until after (3).
So, if enough time passes between (1) and (3), then (4) will observe a layer with
now-last_activity > threshold
and evict itThe fix is to require the first
record_residence_event
to happen while we already hold the layer map lock.The API requires a ref to a
BatchedUpdates
as a witness that we are inside a layer map lock.That is not fool-proof, e.g., new call sites for
insert_historic
could just completely forget to record the residence event.It would be nice to prevent this at the type level.
In the meantime, we have a rate-limited log messages to warn us, if such an implementation error sneaks in in the future.
fixes #3593
fixes #3942